-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extractor API #16
Extractor API #16
Conversation
1bfda77
to
57cc696
Compare
486ac09
to
ebd0919
Compare
57cc696
to
070b712
Compare
Simplify directory structure
b1f5e62
to
0a331b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen if the extractor thinks it can handle the source but then can't?
What about having extractors report a confidence between 0 and 100 that they can extract the source?
Should an extractor extract a single type of data?
I think it should not be limited to a single type of data since parsing the DOM might give multiple pieces of data as side-effects. For example, followers and follows might be on the same page and thus could be extracted at the same time.
What concerns me more is how we'll register extractors considering that we might arrive at a large number of them. Maybe we can have a two stage system where we'd only add a subset of extractors based on the URL matching. WordPress for example could be added for all URLs except those that specifically match other matchers.
That way we wouldn't be registering all extractors on every page (since the content script loads into every page).
What about multi-language sources? We don't need to support it right away, but we should keep it in mind while designing the API.
👍
* Source of data to be extracted, like a DOM document, a URL or any other kind of resource. | ||
* For the moment, only DOM Document is supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on these other kinds of resources? The URL can be accessed through document.location.href
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind having a Source
interface is so that the API does not depend on a specific data structure, that might not be available in all runtimes. If in the future we would like to run an extractor in nodejs
, for example, the document
would not exist (or a least would not have the same type).
Another reason would be that we can envision having extractors that don't rely on a document
, but instead, for example, pull directly from a URL. (We could also make it so that an extractor can support multiple types of Sources
, e.g. DOMSource
and URLSource
).
If we would not introduce the notion of a Source
at this moment, adding support later for multiple types of sources would be a breaking change to the API, which would require updating all existing extractors.
This is something I hadn't considered. I think probably we should only run the content scripts if the extension is currently open. |
I will open a new PR to implement this when required. |
WIP
Doubts
Naming
Need better names for:
SiteData
SiteInfo
?What should happen if the extractor thinks it can handle the source but then can't?
I guess we'd tell the user something like "Extractor didn't find anything, try another one"? Or, instead of having the extractor say whether it supports a source, should we ask the user right away which extractor they want to use?
Should an extractor extract a single type of data?
For example, should the
wordpress-rest
extractor extract posts and pages, or should there be awordpress-post-rest
and awordpress-page-rest
extractor?Having different extractors for different kinds of data would unlock the possibility of having specific extractors for certain data types. For example, there could be
wordpress-product-rest
what extracts products from an eccommerce site.What about multi-language sources?
We don't need to support it right away, but we should keep it in mind while designing the API.